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CIRCUIT ANALYSIS COMPUTING 
BY WAVEFORM RELAXATION 

Conventional exact circuit simulation algorithms, as they are 
unplemented in SPICE (1) and ASTAP (2) and in the follow- 
>n programs, are limited by excessive compute time for the 
tune domain analysis. The difference between the number of 
transistors that can be simulated and the number of transis- 
tors in a very-large-scale integrated (VLSI) circuit is an every- 
ncreasing quantity. For large circuits, the compute time in- 
^ases roughly as 0(»") to <Xn") depending on the circuit 
mder analysis, where n is the number of circuit nodes. This 
as led to new approaches for the solution of these problems 
ae waveform-relaxation method (WR), which is such a tech- 
ique, is ^ lt erative approach for the exact solution of large 
U>1 circuits in the time domain. 

Jhe waveform relaxation technique, as it is presented in 

*d SPICE program (1). Often, however, in VLSI design 
^simulators are used to obtain solutions where accuracy" 
sacrificed for speed. These techniques are not considered in 
" article. However, three examples of such algorithms can 
lound in the references. The ITA algorithm (3) is based on 
ie-point relaxation, whereas the SPECS algorithm (4) uses 
*ewise constant waveforms. Another technique that uses 
*ewise linear waveform approximations together with a 
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highly damped explicit integration scheme is implemented in 
ACES (5). 

A waveform method was first applied to VLSI circuit anal- 
ysis problems in 1980. The starting point was the one-way 
circuit analysis formulation by Ruehli, Sangiovanni-Vincen- 
telli, and Rabbat in 1980 (6), which ignored the gate-to-drain 
capaative feedback in metal oxide semiconductor (MOS) tran- 
sistors. Subsequently, the WR process for circuit simulation 
was invented by Lelarasmee, Ruehli, and Sangiovanni- 
Vmcentelli to address this shortcoming (7). Different versions 
of WR-based circuit solvers were first developed at the Uni- 
versity of California at Berkeley (8) and at IBM Research 
Laboratories (9) and later at several other locations. Numer- 
ous improvements and new applications have been discovered 
by the engineering and mathematical communities. On the 
mathematical aspects, Miekkala and Nevanlinna and Odeh 
(10) contributed much early on to the understanding of the 
convergence issues of WR. 

Researchers now apply the WR approach to a wide range 
of problems from semiconductor device calculations (11), to 
nonlinear parabolic problems (12), and to multibody problems 
(13). In this article we will give only a limited set of references 
that highlight key advances in WR for both scalar and multi- 
processor machines. A complete set of references up to 1986 
are given in Ref. 14. A large chapter in Burrage's book (15) 
is dedicated to the application of WR to mostly homogeneous 
problems such as boundary value problems. It includes an ex- 
tensive set of references on more recent WR work. The termi- 
nology homogeneous and heterogeneous is actually due to 
Gear. Homogeneous means problems that can be described by 
a single set of equations in which the domains have relatively 
uniform properties. The solution efficiency for homogeneous 
problems is a very strong function of the basic WR algorithm 
which hereinafter will be referred to as the internal algorithm.' 

However, the focus of this article is on the heterogeneous 
VLSI circuit analysis problem. Heterogeneous problems con- 
sist of a multitude of different aspects such as linear and non- 
linear parts. All these parts may have a mixture of different 
models embedded such as the conventional macromodels rep- 
resenting semiconductor devices. It is clear that a simple solu- 
tion technique will be very inefficient for these problems. For 
the WR approach to be efficient the interned algorithms must 
be embedded in another layer, which we call the external algo- 
rithms. 

The external WR solution algorithm can be characterized 
by the following steps: 

1. Partitioning of a circuit into small subcircuits 

2. Ordering of subcircuits 

3. Scheduling of subcircuits for analysis 

4. WR iteration until convergence of waveforms 

5. Storing of waveforms in database 

Before presenting the WR algorithms, it is appropriate to 
give some insight into the fundamental reasons why WR can 
be faster than a conventional time point circuit solver. Here, 
we assume that the circuits are sufficiently large that they 
can be partitioned into a reasonable number of subcircuits. 
First, in WR a large number of small matrices are solved 
rather than a single large one. For the usual modified nodal 
analysis circuit formulation (MNA) (16), the size of the matrix 
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is driven by the number of nodes in a circuit The average 
solution time growth rate is Ofo") for a circuit solver using 
sparse matrix techniques. It is obvious that the speedup due 
to matrix partitioning increases as the number of subcircuits 
increases, which is generally an increasing function of circuit 
size. This is obviously one of the factors why WR is fast for 
very large circuits. Also, each matrix can be solved using dif- 
ferent time steps. The fact that the time steps in the subcircu- 
its are different is called the multirate factor. The evidence 
that in large circuits the waveforms are most likely very dif- 
ferent in different parts of a circuit points to the fact that this 
multirate behavior is another factor that increases strongly 
with the number of subcircuits. However, the speedup is re- 
duced by the number of times the average subcircuit is evalu- 
ated due to WR iterations. Hence, it is obvious that the strat- 
egy is to keep the average number of WR iteration as small 
as possible. One of the factors that greatly helps is that a 
waveform solution error of lO" 1 to 10~* is sufficient for the 
circuit simulation problem. The typical number of WR itera- 
tions is between 3 and 4 for a well-partitioned circuit, while 
the WR iterations may vary between 2 and 20 for a typical 
heterogeneous circuit Smaller errors, like those required for 
most homogeneous boundary-value problems, would demand 
a much larger number of WR iterations. For this type of prob- 
lem, the convergence rate, considered later, is a much more 
important factor than for the circuit WR problem we con- 
sider here. 
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Figure 2. "Width" of a DRAM circuit as a function of the logic level, 
starting from the input 



STRUCTURE OF VLSI CIRCUITS 

Special-purpose solvers gain much of their efficiency from uti- 
lizing the specific structure of the problem at hand. VLSI cir- 
cuit solvers are no exception. In fact, we hope that it will be 
clear from this section that a general-purpose WR solver with- 
out special partitioning algorithms would perform very poorly 
for VLSI circuits. We want to identify key properties of large 
VLSI circuits that make them good candidates for WR. To- 
day's parallel computers make the analysis of circuits with less 
than several million transistors excellent candidates for WR. 

As will be explained below, the partitioning step subdi- 
vides these very large circuits into small subcircuits, con- 
taining one to several hundred nodes. Figure 1 shows an ex- 
ample structure of a very large VLSI circuit. Each of the 
blocks may represent a functional unit of a VLSI chip with 
hundreds to thousands of transistors. It is immediately evi- 
dent that these circuits should be partitioned into smaller 
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units that may include one or more functional blocks de- 
pending on the block size. The connections between blocks 
shown in Fig. 1 may involve multiple paths. However, the 
external connections are usually sparse compared with the 
connections within the functional units. It is very important 
to recognize that the number of fanout connections of a circuit 
output is in general very sparse (e.g., 1-6). However, we have 
also encountered circuits with a fanout of 4000. 

Each block has the property that the number of logic levels 
or the logical circuits that are connected in series must be 
limited to meet delay time limits or the system clock cycle. 
Hence, most functional units in Fig. 1 are relatively shallow 
in the number of levels. The functional units become "wider" 
as the number of transistors increases. An example is the er- 
ror detection or correction circuitry of a 16 Mbyte dynamic 
random access memory (DRAM) design, shown in Fig. 2. The 
unit contains over 16 X 10 3 transistors; however, the number 
of logic levels is only 11. As can be seen from the figure, the 
"width" of the unit averages over 200 gates with a large po- 
tential for parallel processing. More insight into this will be 
given in the section entitled "Parallel Waveform-Relaxation- 
Based Circuit Simulation." 

It is evident that the multirate factor increases rapidly as 
circuit size exceeds the size of a functional unit since the 
waveforms may have little correlation especially if they come 
from different functional units. 

INTERNAL WR ALGORITHMS 

In this section we examine the WR iteration process, assum- 
ing that a circuit has already been divided into subcircuits by 
the external partitioning algorithms considered in the section 
entitled "External Global- WR Algorithms." The situation that 
we explore focuses on the local iteration between two neigh- 
boring subcircuits that are part of a large global circuit envi- 
ronment. 
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Figure 1. Basic ttructure of a large VLSI circuit as a set of blocks 
which are interconnected sparsely. 



Fundamental WR Techniques 

The waveform iteration process consists of an approximation 
to the solution of a set of nonlinear differential equations by 
a sequence of convergent waveforms. In the equations that 
follow, (w) is used to indicate the WR iteration index. It is 
assumed that subsystems or subcircuits are generated by the 
previously mentioned external partitioning and scheduling 
techniques. The internal algorithm is designed to solve sub- 
circuit equations that are formed using the MNA approach as 



(1) 
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where x = [v, if, v are node voltages, and c are selected cur- 
rents. The nonlinearities in C(x) are in part due to the tran- 
sistor and integrated -circuit capacitances. To ensure that the 
solution is unique and that convergence for WR can be 
achieved, the capacitor and transistor models are designed 
with care so that they do not have discontinuities. The re- 
quired properties of C(x) and g(x, t) are considered later in 
more detail in the section entitled "Convergence for the Non- 
linear Case." We also do not want to consider the general dif- 
ferential algebraic equations (DAE) that result from general 
MNA equations since the resultant equations are more com- 
plex than the ordinary differential equations (ODE) case con- 
sidered here, although it has been shown by several research- 
ers that a solution is possible for the DAE case, for example, 
in Ref. 17. 

Consider the scalar equation 



Proof: Applying the PL iteration Eq. (4) to Eq. (6) we can find 
the solution to be 



(0 = l-*t+ - (-l) 1 



(7) 



x(t) = f(x.t) 



(2) 



where fix, t) = C _, £(x, t) with the initial condition x(0) = x* 
where C > 0 is a constant capacitor. We gain some insight 
into the waveform iterative solution by considering the Pi- 
card-Lindel6f (PL) iteration technique. In this method, the 
following waveform iteration is suggested: 



x«*+i>(t) = f(t ,*<">(*)) 



(3) 



^" +1 fto) = Xq, where Xo is the initial value, which is the same 
for all iterations. It is assumed that we want to find the solu- 
tion in a window in time t G [*., fj, where f. is the window 
start time and f b is the window end time. For convenience we 
take the window to be * G [0, T), where T is the window size. 
In the PL technique, the solution of the problem is obtained 
by simply integrating the equation as 



(O=*o+/ fir t x^ir))dz (4) 
Jo 



where the initial waveform may be constant in the -time win- 
dow with x< 0) = xo and subsequent iterations -yield new wave- 
fo.-ms^)^)^),. . .. 

As an example, if the subsystem of equations is simplified 
by assuming that gix t t) = -Gx(t), where G represents a lin- 
ear resistor R, or G = VR, then Eq. (1) is reduced to 



x(f) + ax(f) = 0 



(5) 



vhere a = UiRC) is the magnitude of the eigenvalue or in- 
verse time constant If we apply the PL iteration algorithm to 
his RC circuit problem we can make the following statement 
ibout the convergence of the iterative solution: 

Tieorem 1. If we apply the Picard-Lindeldf method to Eq. 
5) on the interval t G [0, T], then the global.error bound is 
iven by 



This is the Taylor-series expansion for the solution x*it) = 
e -- , where a = II RC and Eq. (6) is found from the error term 
in Taylor's theorem. 

From this, we gain insight into the behavior of the solution 
of PL iterations for VLSI circuits that involve RC subcircuits 
with capacitances to ground. First, we observe what is called 
the early-time convergence property of the solution, which 
shows that it converges faster for small times. Also, the accu- 
racy of the solution increases by one order for each iteration. 
By inspecting Eq. (6) we find that the time window [0, T] 
must be kept small in relation to the number of waveform 
iterations such that w > aT, to ensure uniform convergence. 
Hence, it may be desirable to subdivide the total analysis 
time into smaller subintervals or time windows for which con- 
vergence is obtained in fewer iterations. It is clear that the 
waveforms must be converged even at the end of the window 
at / = T 60 that the previous solution will provide a good 
starting point for the next window. We will visit this question 
in more detail in the section entitled "Convergence for RC 
Circuits." 

One-Way Systems and Gauss-jacob! and tauss-Seidel WR 

For WR we assume that the system equation (1) has been 
split or partitioned according to the techniques described 
later in the section on partitioning. To study local conver- 
gence, we focus attention on the behavior between two con- 
nected subcircuits, and we temporarily ignore all interactions 
with other subcircuits. This is not representative of the real 
WR iteration scheme or schedule that involves all subcircuits. 
The local convergence situation is depicted in Fig. 3 where all 
other WR variables due to partitioning with respect to other 
subcircuits are assumed to be external (known) sources as 
shown. Hence, we assume that only the system variables x x 
and xj are relevant for the local convergence situation. 

Local convergence of WR algorithms has been studied by 
many researchers [e.g., Lelarasmee, Ruehli, and Sangiovanni- 
Vincentelli (7), White and co-workers (18,19), and Debefve, 
Odeh, and Ruehli (14)]. We will look at the convergence issue 
in the next three sections. First, we give the most important 
features of the two main algorithms, the Gauss-Jacobi WR 
and Gauss-Seidel WR They are best explained using the 
model in Fig. 3, where subcircuit 1 is excited by an input and 
coupling exists between the subcircuits in both directions, or 



*i(0 = /' 1 (x 1 ,x 2t «(0) 
^2(0 = / , 2 (*i.x 2 ) 



(8) 



|x*(0-x (w) (OI < 



iw + 1)! 

here x*(r) is the converged or the exact solution. 







SCW 1 




SCkt 2 




(6) 








*2« 





Figure 3. Two subcircuits shown to illustrate the local WR itera- 
tion process. 
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where Xi(0) = x l0 and x^O) = x n and u{t) represents the 
inputs. 

A special case exists if the connection from subcircuit 2 to 
subcircuit 1 is missing, or x x (t) - f x (x Xt u(0) only. In this case, 
we have a so-called one-way connection. If we solve the sys- 
tem by solving subcircuit 1 first, followed by subcircuit 2, the 
exact solution is obtained in one forward iteration (6). An ex- 
ample of such a system consists of two metal oxide semicon- 
ductor (MOS) transistor inverters without gate-drain feed- 
back capacitances. Since most logic circuits are highly 
directional even during switching transients, it is evident 
that it is always advisable to solve the circuit in the direction 
of large coupling. 

In the general case, with coupling in both directions, sev- 
eral iterations are necessary to obtain a solution. The Gauss- 
Jacobi WR iteration algorithm is given by 



H 2 

Figure 4. Resistive circuit to illustrate the iteration process if the 
circuit is partitioned at R t . 



x<« +l >(0 = /i(x< 1 ' +1, (0.^ ) »)) 
4 ur+1) (0 = /i(x^ ) (/).x<- +1 >(0) 



(9) 



where x[ m + l Kt<d = x, 0 and xfr+ l Kt*) = x». The' iteration sequence 
or schedule for this case is given by alternate evaluations of 
subcircuits 1 and 2, or 1, 2, 1, 2, . . . until convergence. In 
the Gauss -Ja cob i (GJ) algorithm, all subcircuits are solved at 
iteration (w + 1) using inputs from iteration (w). In contrast, 
the Gauss-Seidel (GS) WR method is given by 



x< w+1, (0 = ACri w+1> (0.^ ) (0) 



(10) 



where xf +u (/ 0 ) = x 10 and A m * l \t 0 ) = x». In this approach, re- 
sults that are computed in the solution of subcircuit 1 at iter- 
ation (w + 1) are used in the solution of subcircuit 2 in the 
same iteration. This ordering~and the immediate use of newly 
computed results allows the GS algorithm to take fewer itera- 
tion steps to converge than the GJ algorithm. For this reason, 
the GS method is generally preferred even though it puts a 
larger burden on the external WR algorithms such as order- 
ing and scheduling, wjiich have to select the subcircuit analy- 
sis sequence. It is not always possible to update all the vari- 
ables as required for GS WR For this case we will use what 
we call a mostly GS algorithm that instantaneously updates 
as many variables as possible. We will revisit this issue later 
in the section entitled "External Global WR Algorithms." 

Successive Under- and Over-Relaxation WR 

The idea of accelerating the solution by overestimating the 
update vector is used for most iteration techniques, including 
WR The basic over-relaxation scheme (SOR) for GS WR takes 
a similar form as in the conventional scheme. A new set of 
waveform variables are introduced, which we call ytt). With 
this the GS SOR WR scheme can be written as 



Hie first practical application of SOR WR to VLSI circuit 
problems was done by Carlin and Vachoux (20). They applied 
under-relaxation to a stiff high-gain problem and showed that 
convergence could be improved by using fi < 1. The definition 
of a stiff problem is one with a large difference in eigenvalues 
or time constants. 

Convergence of WR SOR has been studied theoretically by 
Miekkala and Nevanlinna (21). It has also been applied to a 
semiconductor device problem by Reichelt, White, and Allen 
(22). They used a frequency-dependent over-relaxation factor 
fKf) that was applied to the time domain through a convolu- 
tion operator. The general time window under- and over-re- 
laxation WR technique applied to VLSI circuit problems most 
likely can benefit greatly from a time-dependent factor /3(f) 
for t G [0, 71. 

Convergence for a Resistance Circuit 

The convergence of the WR has been studied extensively for 
the linear circuits by several researchers, for example, Mie- 
kala and Nevanlinna (21) and Desai and Hajj (23). In this 
section, we look at the static case of the small resistance cir- 
cuit, in Fig. 4, which is important for the partitioning step. 
The exact solution for this problem is given by 



u, = LR, -2 — 

3 1 'A, +/?,+*, 



(12) 



which can be found by inspection. 

For the iterative solution we define the forward gain g ( - 
Ri/ifit + RJ and the backward gain g r = R X /(R X + RJ, which 
are simply the voltage divider ratios. This corresponds to 
splitting the circuit at The voltage dividers lead to the 
following voltage ratios: v s = gpi and u, = gj> t . The iterative 
solution yields 



(13) 



yf r+I) W = A(xJT* ,, ff).«f , W) 

x ( 1 w+,) «) = fiyf+ l >(t) + (1 - pyyfHt) 
xj" +l ><0 = fiy?+ l) (t) + (1 - lb£">«) 



(11) 



where yt r4l W - *w and vjr* 1 **©) « x*. The ver- or under- 
relaxation factor is usually in the range 0 ^ P ^ 2. 



with /?„ = (RrfJ/iRt + /?,). With this, the iterative solution 
is given by 

= »«>[1 + g r g r + (g ( g r ) 2 + - - • + (* f *r« (14) 

The contraction factor is given by y = gig r For convergence 
within a few iterations this factor needs to be y < 1. Assume 
as an example that R x = 1, R t = 10, R, = 6. Then g ( = \ and 
g T = A, which leads to y = A. In this case convergence is 
reached in very few iterations to a very high accuracy. Also, 
directionality of coupling can be assigned, even with this aim- 
pie circuit, as we observe since g f > g^ From the logic signal 
flow it is evident that in the directionality is assigned in th 
high-gain ^p'irecti n (14). 
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Convergence for RC Circuits 



As was mentioned earlier, the convergence of WR for general 
circuit* has been studied from the very beginning, and the 
impact of the capacitors on the convergence is considered a 
key issue. In the early work on WR it was assumed that each 
node in the circuit was required to have a capacitor to ground. 
More relaxed conditions have been established recently by 
Desai and Hajj (23) and Gristede, Ruehli, and Zukowski (24). 
Specifically, the convergence of RC-type circuits has been in- 
vestigated by several researchers, for example, Miekkala, 
Nevanlinna, and Ruehli (26), Leimkuhler, Miekkala, and 
Nevanlinna (26), Ruehli and Zukowski (27), and Leimkuhler 
and Ruehli (28). 

At first, it seems that many different RC circuit topologies 
need to be considered to gain an understanding of the WR 
oehavior of RC circuits. However, in VLSI circuits there are 
iwo circuit topologies that appear many times as basic build- 
ng blocks. 'Hie first involves a capacitor connected between 
wo nodes, in which this capacitance may represent a gate-to- 
Irain capacitance. To study its impact, the worst-case RCR 
dtuation is considered in Pig. 6(a), where only resistances are 
onnected to the ground nodes. It should be noted that the 
isual sufficient conditions for WR convergence for example, 
tef. 4 or 16, do not include this case. 

It was shown in Ref. 25 that, even for the RCR circuit in 
*ig. 6(a), convergence can be achieved under certain condi- 
ions. The WR iteration equations for the case where we as- 
ume that a current source is connected to the left node in 
ircuit in Fig. 5(a) are given by 
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ness of the MOS capacitance models. It is confirmed by the 
mathematical analysis in Ref. 26 that slow WR convergence 
can be achieved for the limiting case in Pig. 6(a) in terms of 
Sobolev norms, which measure the derivatives as well as the 
functional values. 

*Hie second basic RC circuit is the low-pass CRC circuit, 
Fig. 6(b), which was analyzed by Ruehli and Zukowski (27) 
for simple cases, and for more complex RC circuits by Leim- 
kuhler and Ruehli (28). Unlike the above RCR circuit, the 
CRC circuit seems to lend itself well to partitioning. However 
there seems to be a problem in that it is hard to give a static 
argument for partitioning at the resistor R 2 as was done in 
the previous section for resistive circuits. The voltage transfer 
function for the case in which we excite the circuit with a 
voltage source V a in series to d is given by 



(0 + 



R«a 



v 



,;(»+!) 



(0 + 



R 2 C 3 



(0 = 



R 2 C Z 



(0 



This can be written in the Laplace domain as 
(sI + M)v {w+1) (s)^m {w \s) 



(17) 



(18) 



where M and N are evident from Eqs. (17) and (18). We can 
rewrite Eq. (18) as 



(19) 



(0 + 



C " +1) "^^ (O (15) 



*A V1 (0= c 2 



l) ^ +l>( ° + R& ^ +1)(0 = ^ >(0 (16) 

hese local mapping functions show that the derivatives at 
ie of the partitioned nodes i>, or v 9 are a function of the de- 
vative at the other end of the partition v s or v u respectively, 
is intuitively obvious that for this case not only the input 
rang functions but also the derivatives must be continuous 
r the WR iteration to converge. In actual VLSI circuits this 
iue is somewhat moderated since the gate-to-drain capaci- 
aces for the MOS field-effect transistors (MOSFET) have 
least some capacitances to ground at each end Again, the 
rationing of the capacitance between gate and drain is very 
3irable in spite of the difficulties since a MOS transistor is 
>erfect one-way device if the capacitive coupling is ignored, 
is analysis emphasizes the requirements for the smooth- 



where the meaning of the symbol K(s) is evident from compar- 
ing the last two equations. The following theorem from Miek- 
kala and Nevanlinna (10) is applied to the problem to find the 
spectra] radius. 

Theorem 2. Assume that the eigenvalues of M have positive 
real parts. Then the spectral radius of K(s) is p(K) = 
max^eaO'cuI + Af)~ l N. For this case we have 



_sR 2 C 3 + 1 



sR 2 C t + 1 
0 



(20) 



and it is clear that the minimum occurs for s = 0 where 
p(M0)) = 1, which indicates that the convergence problem 
could occur at s — ► 0. 

In Refs. 27 and 28 it is shown that this problem does not 
occur for a finite time window. For convenience, we set both 
time constants to unity by choosing C\ = C, = R t = 1. Then 
we can find the iterative solution to be 




„<«> (0 = e -<y- 1 

1 (2m)! 



fltcsO 



(21) 



With (2m)! ~ VS^m)*" 1 ^-*- the error term is 

VLSI £ JZf 1DdamentaJ dTOlit toPdogie" *™ partitioning as part From this we can derive the rapid convergence of the parti- 
tioned circuit provided that the window is email enough. 



(a) 



(b) 
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Theorem 3. The WR sequence converges rapidly in a time 
window T after the u>th iteration for w ^ <T/2. 

We can see from this that for our normalization R\C r = 1 and 
Rfi t = 1 the convergence is very fast for w ^ eT/2. Hence, 
the larger the time constants of the two partitioned circuits, 
the larger the time window T for which rapid convergence 
occurs for a particular number of WR iterations w. It should 
be noted that this type of partitioning can again be done stati- 
cally in the partitioning process by choosing an appropriate 
value of the time window T and the approximate number of 
WR iterations. 

Another important observation can be made from this 
analysis on convergence behavior for windowing. First, most 
realistically modeled nodes for VLSI circuits, with the excep- 
tion of the gate-to-drain capacitances, can be represented by 
the basic circuit in Fig. 6(b). Hence, the convergence behavior 
shown in this section given by Eq. (22) is quite typical. It 
shows that if the window T is chosen too large or equiva- 
lent^, the number of WR iterations w are chosen to be insuf- 
ficient, then the solution may be quite poor since the rapid 
convergence regime has not been reached. Specifically, at the 
window boundary t = T, the approximation and, even more 
important, the derivatives of the solution are approximated 
very poorly. Then, multistep integration techniques, such as 
the popular BDF2 method (29), that utilize solution points 
from the previous window are used to continue the solution 
in the next time window. This obviously represents a very 
poor starting condition for the solution in the next time 
window. 

Convergence for the Nonlinear Case 

Key aspects of VLSI circuits are the.nonlinear MOSFETs and 
capacitances associated with the devices as well as the on- 
silicon diffusion wires and diodes. This requires a nonlinear 
analysis of the convergence, which has been available since 
the start of WR, for example, Lelarasmee, Ruehli, and Sangio- 
yanni-Vincentelli (7), White et al. (18), White and Sangio- 
vanni-Vinceritelli (19), and Debefve, Odeh, and Ruehli (14). 
However, the nonlinear WR convergence proofs have become 
more general in recent years. The proofs by Schneider (17) 
and Gristede, Ruehli, and Zukowski (24) take the DAE (differ- 
ential-algebraic equation) for the MNA circuit formulation 
(17) into account. Also, more useful bounds have been derived 
with a one-sided Lipschitz constant by in 't Hout (30) and in 
Burrage's book (15). Here, we give an interesting and relevant 
proof of Taubert and Wiedl (31) that iUuminates the nonlin- 
ear convergence in terms of a time window T. The vector u is 
given by u = («i, u lt . . ., u m ), and the two relevant norms 
are = max<_ u . ^ \ui and fek = max^on 

The circuit equations for a MOSFET circuit including the 
voltage-dependent capacitors are given as [similar to Eq. (1)1 

C(x)x(t) = G(x t t) (23) 

with the initial value x(0) = x©. All the conditions below are 
assumed to apply in a window in time, which we choose to be 
t 6 [0,71. Tne voltage excursion must be contained for the 
semiconductor devices such that the nonlinearities can be de- 
scribed by a valid circuit model. Hence, w assume that limits 
are also applied on the particular values of x so that the con- 
ditions of the theorem are met 



First, we assume that the transistor nonlinearities satisfy 
the Lipschitz continuity condition 

0G(x,.O-G(x 2 .OIoo <*Hx, -x 2 (lco (24) 

for the allowed values off, x lt x v Second, the nonlinear behav- 
ior of the m x m capacitance matrix with respect to the real 
vector z of length m must satisfy several conditions. Each ele- 
ment of the capacitance matrix C(z) must satisfy another 
Lipschitz continuity condition with a constant L that applies 
for all j for the real vectors u, v: 

m 

£ M"> - c Jk<»)\ < L H« - v Boo (25) 

A further condition is imposed on the capacitances. We as- 
sume that there exists a constant a > 0 such that for all real 
vectors, u, v where u > 0 we have 

[C(z)u] l u i >aBi*i& (26) 

This condition can be viewed as being related to the instanta- 
neous energy in the system of capacitances C, which is given 
by l/2u T Cu > 0 for u ^ 0. For a nodal capacitance matrix, 
this implies diagonal dominance. For the nonlinear case, the 
requirements in Eq. (26) are somewhat more restrictive than 
what is required for the multiple capacitances for which 
z ~ u. 

The WR equations for two subcircuits in which each sub- 
circuit is represented by a single equation are given by 

c n (xf+^x<">)^ 

= G 1 (xJ w+1 >(0,x^>(0.0 
(27) 

c 2 i(*r +i x w+ M } ^ 

where the contribution of the nonlinear capacitances is evi- 
dent The initial conditions are xf* +l \fo) = *io and xjT 4l tyo) = 
Xxy Note that in terms of capacitances c u = C t + €%, c u = c M 
= -Cj, and c a = C t + C it where the circuit consists of three 
capacitances with the same topology as the circuits in Fig. 5. 
All three capacitances C u C t , and C s can be nonlinear. 

Now, we are ready to state the very interesting condition 
for nonlinear convergence in a time window [0,71. 

Theorem 4. The sequence of approximate solutions given by 
the WR iterations x° p+1> converges uniformly to the solution 
x* of Eq. (23) in [0,71 for which the following condition holds: 

a-KT-LTU*l T > 0 



Proof. Here, we only give an outline of the proof. The unique- 
ness of the solution of Eq. (23) is guaranteed by the conditions 
given earlier in a time window t £ [0,71. For any continuous 
diflferentiable function x(J) with the initial condition x(0) = 
x*(0), we form th difference 

Afx. x*) = C(x)x - C{x*)x* - G(x. t) + G(x\ Q (28) 



which can be expanded by adding and subtracting the quan- 
tity C(x)rVFor I € [0.71 we form the quantity 



Using also the fact that 

*? ) (O=x i0 +£x<<>Hs)ds 



(29) 



(30) 



and using the Lipschitz conditions in the expanded form of 
Eq. (29) with K t L from above, we can show the inequality in 
Theorem 4. We assume that the size of the time windows T 
is adjusted during the transient analysis. 



It is evident from this that both the nonlinearities of the 
capacitances and the devices can reduce the maximum size of 
the time window during the highly nonlinear transitions of 
the devices for which usually the smallest time windows T 
occur. This transition time is usually a small part of the tran- 
sient analysis time. Also, the transition is the time where the 
circuit solver will take very small time steps, so that T in- 
cludes a reasonable number of time steps. 

Newton Variant of WR 

Given Eq. (2), in a general form, the WR schemes considered 
so far first partition the system at the differential equation 
leveL Then the nonlinear equations are solved separately for 
each subcircuit using Newton's method Van Bokhoven (32) 
considered a variation on WR by essentially interchanging 
the waveform loop with the Newton linearization of the equa- 
tions. Hence, the Newton variant of WR starts by linearizing 
Eq. (2) for the entire circuit. The system of equations rewrit- 
ten in a functional form is 
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much larger number of WR iterations are required than for 
the circuit simulation problem. 

Other issues are of importance for a large VLSI circuit 
problem for which the general WR algorithm offers several 
advantages over the Newton waveform approach. First, for 
the conventional WR, circuits are partitioned at the schematic 
level into self-contained subcircuits that are analyzed inde- 
pendently using a conventional circuit solver. Furthermore, 
the interaction between subcircuits and functional units, at 
all levels, simply consists of the exchange of segments of 
waveforms of various sizes. Other techniques such as the hi- 
erarchical WR techniques (35) and parallel WR discussed 
later benefit greatly from the simplicity of the conventional 
WR approach. 



EXTERNAL GLOBAL WR ALGORITHMS 



F(x) = C(x, t)x(t) -g(x,t) = 0 



(31) 



This form can be linearized using the Newton scheme as 

x (n +1) = x („, _ j-i (x to )F(xM) ( 3 2) 

where J/x) is the Frechet derivative of F and where n is the 
nonlinear or Newton iteration index. This method has been 
explored by many researchers [e.g., (18,19,32-34)]. It can be 
shown that the resultant scheme 



* in+l >-J»x<*+ l > = fai*>)-j mX 



.(») 



(33) 



In this section we consider the overall environment that is 
required for a heterogeneous VLSI circuit WR solver where 
the circuit structure may be extremely nonuniform. This is 
especially true for mixed analog-digital circuits. To make the 
issue more complex, feedback loops in logical circuits may re- 
quire more WR iterations than the local WR interfaces re- 
quire to converge. It is evident from this that all aspects of a 
WR program must be implemented carefully to obtain maxi- 
mum overall efficiency. Furthermore, it is very unlikely that 
the optimal number of WR iterations^ uniformly the same 
for all local interfaces between the subcircuits. Before we can 
consider these global convergence issues at the end of this 
section, we first must introduce other fundamental concepts 
such as partitioning, ordering, and scheduling. A detailed de- 
scription of the concepts is given in Ref. 14. A key aspect of 
the external environment is the storage of the waveforms. As 
will be evident below, the waveforms for iterations w and 
w + 1 must be available for computations. 

Partitioning 

The partitioning of a circuit into small subcircuits is clearly a 
heuristic process for heterogeneous systems. One of the key 
driving factors for partitioning is that convergence of the in- 
ternal WR algorithms must be enhanced by the partitioning 
process. This has been recognized since the beginning of VLSI 
WR, and much work has been dedicated to this issue through- 
out the evolution of WR [e.g., Carlin and Vachoux (36) White 
and Sangiovanni-Vincentelli (37), CockeriU et al. (38), and 
John, Rissiek, and jPaap (39). 



is another splitting of the circuit matrix of the entire circuit 
(15). If we apply only a single Newton iteration n = 1, we 
can partition the resultant circuit matrix and we can use an 
external waveform relaxation loop. At this level, all the neces- 
sary algorithms such as windowing are applied. 

The Newton waveform technique has been successfully ap- 
plied to the homogeneous semiconductor problems by Lums- 
daine and White (11). These problems do not require a com- 
plex partitioning procedure as is the case for heterogeneous 
iystems For homogeneous problems the Newton waveform 
tpproach is preferred for its quadratic convergence behavior, 
riowever, this is more of an issue for the solution of homoge- 
neous systems since much more accuracy and therefore a 



Definition 1. Partitioning means subdividing a large circuit 
(Ckt) into small subdrcuits (SCkts). The SCkts are chosen in 
such a way that coupling between subcircuits is niinimized 
and that convergence is enhanced. 

It should be noted that this type of partitioning is also known 
as multisplitting or diacoptics. Most partitioning algorithms 
are static; the partitions are defined before a transient analy- 
sis is performed. In fact, it is the first step in the overall WR 
scheme. Some exploratory work on dynamic partitioning has 
been done by Dumlugol, Cockx, and DeMan (40) for specific 
circuit structures in which the partitions are altered during 
the iteration process. It is evident that for large heteroge- 
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2 3 




(a) (b) 

Figure 6. (a) A MOS transistor circuit partitioned into three subcir- 
cuits; (b) A directed graph corresponding to Pig. 6(a), which shows 
the main logic signal flow. 



neous circuits static partitioning is preferred since it can be 
designed for all types of structures. 

Two of the most popular methods are pointwise and block 
partitioning. Pointwise partitioning breaks the circuit at each 
node, generating subcircuits with one node each. This scheme 
does not control the coupling between subcircuits. On the 
other hand, block partitioning groups one or more nodes into 
SCkt based upon estimates of the coupling of the circuit ele- 
ments that connect between them. The techniques in the pre- 
vious sections entitled "Convergence for a Resistance Circuit" 
and "Convergence for RC Circuits" are applied to see if two 
nodes should be in the same SCkt by evaluating the potential 
coupling. The nodes of the resultant SCkt are then ensured 
not to be strongly coupled to other SCkts at least in one direc- 
tion. This direction is away from the SCkt for an output and 
into the SCkt for an input Hence, block or subcircuit parti- 
tioning leads to much faster WR convergence. 

Most WR programs also use graph theoretic partitioning 
algorithms like the strongly connected or dc connected compo- 
nents (14). An example of a circuit that has been partitioned 
into dc connected components is shown in Fig. 6(a). In Fig. 
6(b), a directed graph is shown that corresponds to the circuit 
in Fig. 6(a). 

Next, we consider in more detail the decision process for 
the assembly of nodes into SCkts. One of the algorithms used 
is the diagonally dominant Norton (DDN) algorithm by White 
and Sangiovanni-Vincentelli (37), which is based on tech- 
niques given in the section on resistance-circuit convergence 
for static partitioning. This algorithm is based on the idea 
that two nodes may either be coupled only resistively or ca- 
pacitively. The simple circuit in Fig. 4 provides a model for 
applying this algorithm. Consider the resistor R t to represent 
all parallel conducting paths between any two nodes. These 
include all resistive and inductive elements as well as "worst- 
case" values for the nonlinear conductances of semiconductor 
devices. The inductance voltage drops are set to zero for the 
conductance between nodes. The resistances R x and R% repre- 
sent the equivalent resistance of all local paths to ground. 
Again, this is done by ignoring all capacitance in the circuit 
for these two nodes. If the convergence factor y = g(g r is 
greater than some threshold value, usually chosen to be be- 
tween 0.3 and 0.95, the two nodes are considered to be 
strongly coupled and are placed into the sam subcircuit. 



Since the model is set up using the worst case for nonlinear 
resistances, a slightly higher value of the threshold is used 
since the gain estimates are conservative. It should be noted 
that the same technique can be used for a circuit that in- 
cludes only capacitors in exactly the same way as in the sec- 
tion on convergence for RC circuits in which the "equivalent" 
resistance values used are given by R = 1/C. All pairs of 
nodes that are directly connected to one another are consid- 
ered in the partitioning process, and the algorithms just de- 
scribed will decide whether to place them in the same subcir- 
cuit or not Hopefully, the resultant SCkts are small so that 
each has only a few nodes. However, if too many single-node 
subcircuits result, it may be advantageous to merge some 
subcircuits into larger ones. Merging or condensing will re- 
duce the number of WR iterations at the expense of having to 
solve larger subcircuits. An example of a situation where it 
may be advisable to condense subcircuits occurs when global 
feedback loops exist in the circuit This will be explained in 
more detail in the next section 

Ordering and Scheduling 

Definition 2. Ordering is denned as the process of labeling 
the subcircuits in an increasing order starting with the one(s) 
that should be solved first 

It is evident from the example of a one-way inverter chain, 
Fig. 7, in which the gate-drain capacitances are ignored, that 
the solution starting from input to output leads to conver- 
gence in one WR iteration. On the other hand, if the subcircu- 
its are ordered from output to input, m inverters require m 
WR iterations for convergence. It was shown in the first sec- 
tion that large VLSI circuits that are simulated by WR tech- 
niques have many parallel paths, which can result in the 
same ordering in each of the paths or chains of SCkts. 

Ordering becomes more difficult for circuits with feedback 
loops. There are two possible choices for dealing with feed- 
back. An example is given in Fig. 6(a) for a circuit with feed- 
back. This is apparent from the graph in Fig. 6(b). For small 
feedback loops involving only a few SCkts like that given in 
this example, it may be more efficient to form a larger SCkt 
by merging all the SCkts in a feedback loop into a single 
SCkt. For larger loops, it may be better to cut the feedback 
loops (14). Application of these techniques results in a new 
set of SCkts without cycles. The feedback-loop-cutting algo- 
rithm can also be viewed as a mostly Gauss-Seidel approach 
in which the values at the feedback input nodes are specified 
at iteration (w + 1) by using feedback values from x*" } as is 
done for all variables in the Gauss-Jacobi technique. All the 
other nonfeedback variables are updated in the GS fashion 
So-called strongly connected component techniques (41) are 
used to detect the inputs to the feedback loops, which are the 




Figure 7. Chain of MOSFET inverters which is used to illustrate 
different acjieduMng techniques. 



GJ variables. An ordering for the resultant circuit in terms 
the SCkts is found by the leveling algorithm. 

Leveling Algorithm. 

Input: SCkt graph 

Output: Assignment of Ckts to ordering levels 

Function Level izeSubCircuits 
( 

LevelNumber = 0; 

Assign inputs to LevelNumber = 0; 
REPEAT 

FOR each SCkt s in LevelNumber { 
FOR each SCkt k in fanout set for s { 
NumberOfOrderedlnputs = 
NumberOfOrderedlnputs + 1; 
IF NumberOfOrderedlnputs == 
NumberOf Inputs 

assign SCkt k to LevelNumber + 1; 

} 

) 

LevelNumber = LevelNumber + 1; 
UNTIL Level with LevelNumber is empty 

} 

For scalar WR it is sufficient to order the SCkts in such a way 
that the transient analysis of each SCkt in a lower level is 
jcheduled before the next higher-level SCkt is so that all the 
aput variables at iteration (w +1) are available before the 
ransient analysis for a SCkt is performed. 
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of 



definition 3. Scheduling means the scheduling for analysis 
t a subcircuit according to the ordering until WR conver- 
ence is achieved. 

Most existing WR-based circuit solvers use what is called 
asic scheduling. However, other scheduling methods may be 
lore efficient than this approach. Here, we consider a simple 
tain of ordered SCkts to illustrate the different techniques. 
0 make the example applicable to all scheduling techniques 
interest, we use a special circuit for which we can overlap 
me of the nodes of any pair of SCkts. The overlap means 
at a subset of nodes in a subcircuit may be shared between 
■'o neighboring subcircuits. We take the chain of five invert- 
3 in Fig. 7 as a simplified example and order it from input 
output as is shown in Fig. 7: 



; a basic «*edule, the SCkts are analyzed according to the 
ering starting from the input, and a basic analysis sched- 
is given by 



th windows separately. It is well known that the number of 
WR iteration for convergence is very nonuniform for the dif- 
ferent time windows due to the switching activities of the 
transistors. 

To properly explain the overlap scheduling technique we 
assume that each of the subcircuits has several nodes unlike 
the very simple inverter chain in Fig. 7, which has only one 
node per subcircuit. It is intuitively obvious that if we could 
overlap or share some of the nodes of the neighboring SCkts 
during the local WR iteration, then the convergence would be 
enhanced at the cost of having to analyze the shared nodes 
twice as many times at each iteration. Here we assume that 
each SCkt now has three nodes instead of one. We still as- 
sume that we have a chain a five SCkts, where the labels of 
the three internal nodes per SCkt are given as 

11 12 13 21 22 23 31 32 33 41 42 43 51 52 53 

It is evident that many different overlaps can be chosen in 
this example even if only the boundary nodes are shared be- 
tween subcircuits. To illustrate the fact that the overlap does 
not even have to be symmetrical we give the following over- 
lap example: 

11 12 13 21 21 22 23 31 31 32 33 41 4142 43 

51 51 52 53 
11 12 13 21 21 22 23 31 31 32 33 41 41 42 43 

51 5152 53 



1 2 
1 2 
1 2 



3 4 5 
3 4 5 
3 4 5 



re we assume in this example that global convergence has 
) ****** ^ three WR iterations. If there are several time 
lows in the analysis, we execute this schedule for each of 



Specifically, we chose an overlap in which one node of the 
next SCkt is taken into account while analyzing a SCkt How- 
ever, we assume that it is a waste of compute time to also 
take the corresponding node from the previous subcircuit into 
account An example for symmetric overlap for subcircuit 2 
would be 13 21 22 23 31. From this it is evident that the 
number of nodes in each SCkt analysis can grow rapidly with 
overlap scheduling. Hence, the reduced number of WR itera- 
tions must be balanced against the analysis of larger subcir- 
cuits. We observed experimentally that overlap scheduling 
does not work well for circuits in which the coupling is suffi- 
ciently weak as is the case for an inverter chain. Its main 
application is for situations in which the coupling is strong 
for a large number of nodes such that large sulxircuits result 
Overlap scheduling works well even if the circuits are very 
strongly coupled. Hence, it is best applied to severe coupling 
situations. Overlap scheduling has been applied in different 
forms since the beginning of WR The first paper applying 
overlap scheduling was by Mokari-Bolhassan, Smart, and 
Trick (42), and a thorough mathematical analysis and further 
extension were given by Jeltsch and Pohl (43). For heteroge- 
neous circuits overlap scheduling is especially applicable to 
the situation in which the coupling is very strong such that 
some very large subcircuits would result In this situation, 
the additional cost of the overlap is offset by other gains in 
compute time. 

Another important method with the potential to improve 
the overall efficiency of WR is e scheduling by Odeh, Ruehli 
and Carlin (44). To show how the method works we consider 
the coupling in a matrix system rather than a system of dif- 
ferential equations. A typical form for the systems is 



(L+£)x = fc 



(34) 
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where L is a lower triangular invertible matrix representing 
strong forward coupling and E is a matrix with a sparse array 
of small coupling terms of 0(e) and zeros in all other locations. 
For simplicity we assume that the small couplings in E can 
be arranged in a vector, which we again call E - (v u v t , v if 
. . .) to retain meaning. For a given m we divide the vector 
into two parts, E - E x + E ir where E x - (v u v tt t/ 3 , . . ., v m -i) 
and where E t is the remainder of the vector. We need to note 
that the feedback element v, corresponds to the variable 
With this we can see that if we ignore the feedback from 
the variables x h j > m one simply has to set all elements in 
E t to zero. We denote by y the variables of the truncated sys- 
tem corresponding to Eq. (34); then the new system is 

(L + E x }y = b (35) 

We now can define the error vector e = x - y, which is due to 
the truncation of the 0(e) feedback variables. The following 
will give an indication of how the errors propagate in both the 
forward and backward directions. 

Theorem 5. For a system of size N, the components of the 
error e for the truncated system, Eq. (35), in comparison with 
the fully coupled system, Eq. (34), is given by e k — Od*~ k ) for 
the backward direction 0 < £ < m — 1. The error in the for- 
ward direction is given by e k — 0(e) for k > m. 

The proof of this theorem is given in Ref. 14. It gives us a 
clear indication of how the scheduling can be changed to im- 
prove global convergence. One key observation is that the er- 
rors decay rapidly in the back direction due to the e ba decou- 
pling. This implies that a very good^result can be obtained for 
the present SCkt even if we have not analyzed all the SCkts 
with a higher order. On the other hand, once an error has 
been committed somewhere along the chain it will propagate 
forward to all SCkts with a higher order until the error has 
been corrected with further iterations. 

We can utilize this result to come up with a scheduling 
that we call e scheduling, for now-obvious reasons. The e 
schedule is applied locally and it "propagates" forward, mak- 
ing sure that convergence is achieved locally after all the WR 
iterations have been executed. For the inverter chain example 
an e schedule is given by 

1 2 
1 2 3 
2 3 4 
3 4 5 
4 5 

Techniques such as overlap scheduling and e scheduling 
clearly are. more difficult to apply for complex circuits with 
complicated fanout situations due to the complicated parti- 
tioned circuits and logical signal flow. We give results for an 
inverter chain, which is the simplest circuit with which to 
illustrate these concepts. We consider chains with 16, 32, and 
64 inverters that are partitioned into SCkts where each SCkt 
has two inverters. This partition was found to giv the best 



Table 1. Conventional Circuit Analysis p«, WR Analysis 



Ckt 
Name 


No. 
Trans. 




Time (s) 




Conventional 


Basic 
Sched. WR 


e-sched. 
WR 


Ch8 


16 


26 


41 


34 


Chl6 


32 


118 


100 


48 


Ch32 


64 


635 


270 


171 



results in all cases. Table 1 compares a conventional circuit 
analysis result with a basic and 6- schedule WR analysis. 
These results give insight into the general behavior of the 
solution gain for WR over conventional circuit simulation. 
First we observe that very small circuits have little multirate 
and matrix overhead, so one would expect the conventional 
SPICE type solution to be faster than WR, which is indeed 
the case. Hie other interesting comparison is between differ- 
ent scheduling algorithms for which the difference is a non- 
monotonic function. We did not try to apply overlap schedul- 
ing to this since the coupling for the inverter chain is 
moderate. It would not be a fair test for overlap scheduling, 
which excels in strongly coupled situations, with a weakly 
coupled example. Finally, we would like to point out that the 
scheduling techniques can be combined. For example, overlap 
and e scheduling can be used for different parts of the same 
circuit 

Global Convergence 

With the techniques described previously we are ready to con- 
sider the difficult global convergence issues. The local conver- 
gence between two SCkts has been examined extensively ear- 
lier in the section entitled "Internal WR Algorithms." Global 
convergence deals with the interaction of thousands of SCkts 
with different waveform interfaces. The gain or loss of effi- 
ciency due to the global algorithms can be considerable. A 
single local waveform interface with poor partitioning can 
slow down the convergence of a large circuit The well-known 
example is the one-way inverter chain for basic scheduling, 
where the local feedback e = 0, with the exception of the 
mth stage which has a feedback of e This may be SCkt tjx = 
3 in the example of Fig. 7. In this case, the best number of 
WR iterations is given by one iteration up to circuit m - 1. = 
2. Then, the local iteration between SCkts 2 and 3 should be 
iterated to convergence. Finally, all the rircuits following 
SCkt 3 again require only one iteration. It is evident that a 
brute-force global analysis using a basic overall schedule 
would be much more costly than an analysis with the best 
possible global schedule outlined here. 

The general situation for a subcircuit may be very chal- 
lenging due to the potentially complex interconnections. A 
SCkt m with its variables x m corresponding to Eq. (1) is repre- 
sented with all the potentially connected variables corre- 
sponding to other subsystems as 

Ic^oT" *Sr l, .«S?t aSX**" 

+ E c^r* «sr +i, .<?, wur (36) 

f-m+l 

- /"-cxr* 'Sr".*^, *£\«> - o 

where the system is of size M. 
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The updating in this equation is of the GS type. It is clear 
that the challenge is to partition a large circuit in such a way 
that similar local convergence factors result for all the vari- 
ables involved. A relatively uniform basic schedule can then 
be used without large inefficiencies. This is complicated by 
the presence of feedback loops. Feedback loops have been in- 
vestigated by several researchers like Juan and Gear (45) and 
Johnson and Ruehli (46). The work in Ref. 42 uses a theoreti- 
cal model, while the work in Ref. 43 is based on numerical 
experiments. Experimental evidence shows that tight feed- 
back situations, which do not include many subcircuits inside 
a feedback loop as they exist in flip-flop circuits, lead to a 
much larger number of WR iterations than loops that involve 
more SCkts. This is due to the instantaneous and highly non- 
linear interactions of the SCkts in tight loops. More details on 
this issue will be given later in the section entitled "Parallel 
Waveform-Relaxation-Based Circuit Simulataon." 



Tabled Effects of Window Size 



Ckt Name 


No. Trans. 


Analysis time 
(ns) 


Best Window 
(ns) 


Ch8 

Chl6 

Ch32 


16 
32 
64 


4.5 
8 
16 


4.6 

2 

4 



riod of the oscillator determines the best window size, as is 
shown by Urahama and Kawane (33). 



WINDOWING AND OTHER EFFICIENCY IMPROVEMENTS 
Windowing 

It was shown in the section entitled "Fundamental WR Tech- 
niques" that convergence is a function of the analysis time 
interval Specifically, the local convergence can be accelerated 
by subdividing the total analysis time into a series ofsequen- 
tial unequal time windows. All subcircuits need to be solved 
to convergence within a time window before moving on to the 
next window. The time window needs to be as large as possi- 
ble to allow the SCkts to operate with independent time steps 
such that the multirate factor is maximized. In contrast to 
this it has been observed in Theorem 4 that the larger the 
nonlinearities or equivalent^ the Lipschitz constants K and 
L > ^ smaller we must choose the window sizes T Fortu- 
nately, the time step may also be several orders of magnitude 
smaller during the high-gain nonlinear transition where K 
and L are large such that the number of time steps per win- 
dow is not drastically decreased during the highly nonlinear 
transitions. Hence, we can still expect to obtain a reasonable 
multirate factor. 

Time-windowing algorithms have been suggested by sev- 
eral authors [e.g., (14,19,47)]. Peterson and Mattisson (47) 
suggest a time-windowing scheme that initially creates win- 
dows whenever an input waveform changes state. Then as the 
analysis proceeds, windows may be truncated based on the 
convergence rate of the subcircuits and the number of accu- 
mulated time points. By limiting the number of time points 
witnin a window, memory requirements can easily be man- 
aged and controlled. 

In general, it is very hard to come up with heuristic win- 
dowing algorithms for heterogeneous circuits. The best win- 
Jew size is not only determined by the local convergence rate 

2iff ? feedback lo0ps 8uch ** a **** <> r a ring 

>sallator loop. Hence the dynamics of the local situation plays 

' TT* S AoU * of the M ™ size as we will illus- 
PMnS ' ^ "*> same complementary MOS 

CMOS) inverter chain in Fig. 7 for the windowing examples 
a we did for the partitioning and scheduling. We note from 
ne data in Table 2 that the best results are only weakly de- 
endent on window size. The dependence is much stronger for 
routs with a complex fanout structure and for strong feed- 
ack situations such as a ring oscillator. In this case, the pe- 



Latency 

Efficiency improvements in the WR method have been pur- 
sued almost since its beginning. Waveform convergence may 
be measured by different weighted norms based onf^or on 
the g-fc norm, which may lead to a more sensitive criterion. 
This issue was first reported by Debefve, Hsieh, and Ruehli 
(48). 

Some of the additional convergence testing concepts lead 
to considerable reduction in compute time. For a large circuit 
there usually exist some subcircuits that do not need to be' 
analyzed, because their surrounding subcircuits do not 
change over a particular window in time T. This situation is 
stated in the next paragraph in some detail. 

For a given subcircuit SCkt, we call all the associated 
waveforms x(tl They include the external waveforms X&) and 
the internal waveforms xKO, corresponding to nodal voltages 
or current external or internal to the given subcircuit SCkt 
respectively. 

Definition 4. A SCkt is said to be latent if 

1. The SCkt has been analyzed at least once for the pres- 
ent time window 71 

2. All external waveforms X&) associated with the SCkt 
do not change between iterations (w) and (w + 1) in the 
present time window T, This change is measured by 
comparing 



i4 w) (o 



(w- 



' 1} (OI < e A +e R max Bjc^^J 



(37) 



where ^ is the absolute waveform error and ^ is the 
relative waveform error. 

Then the subcircuit SCkt is declared latent and is not ana- 
lyzed until either the inputs x&) change or the analysis 
moves on to a new time window. Essentially, latency is the 
limiting form of partial waveform convergence considered in 
the next section. The application of latency can lead to an 
appreciable improvement in overall solution efficiency. For 
example, the solution of a 4-bit ALU with~282 FET transistors 
analyzed on a small IBM RS/6000 workstation required 249 
central processing unit (CPU) seconds without the above la- 
tency algorithm invoked, as compared with 101 CPU seconds 
with the latency algorithm used. 



Partial Waveform Convergence 

This algorithm represents a more elaborate form of latency. 
It was recognized that many waveforms were rejected toward 
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the end of the time window T due to the nonuniform conver- 
gence of the WR process. This nonuniform convergence was 
considered earlier. The partial waveform convergence is given 
by the following algorithm. 

Definition 5. A SCkt is said to be partially converged or I 
partially latent if 

1. Hie SCkt has been analyzed at least once for the pres- 
ent time window 7\ 

2. All waveforms XsfJ) associated with the SCkt do not 
change up to the time point I for (w) and {w + 1). This 
change is measured as 

lx£ } (t)-x£- l >(t)l <e A + € R maxIx^Boo (38) 

where € A is the absolute waveform error and €r is the 
relative waveform error. 

Then the subcircuit SCkt does not need to be re-solved over 
the entire interval [0, T] for iteration (w + 1), but only over 
the shorter interval [f , TJ.The application of partial waveform 
convergence can lead to an appreciable improvement in over- 
all efficiency. For example, the solution of a clock-signal-gen- 
eration circuit containing 1059 FETs run again on a small 
IBM RS/6000 workstation required 2361 CPU seconds when 
partial waveform convergence was not used, versus only 2430 
CPU seconds using the partial waveform convergence just 
mentioned. 

Coupled and Preconditioned WR 

The WR approach has the potential to be used in many differ- 
ent ways due to its iterative basis. Here, we consider two dif- 
ferent important aspects on how a WR circuit solver can in- 
teract with other circuit solvers. Several circuit simulators 
must cooperate together in a multilevel simulation environ- 
ment A higher-level simulator may have to be coupled to a 
WR circuit solver. A multirate waveform interface (49,50) is 
a very good way to couple tools together by exchanging wave- 
forms during each time window. However, the coupled wave- 
forms may have to be subjected to some processing such that 
the waveforms full fill the appropriate smoothness conditions. 
The WR solver will supply the appropriate master time 
windows. 

Another approach has been proposed by Burrage (15) in 
which the waveforms are preconditioned with some other 
waveforms. Very good waveforms may be obtained from a 
faster more approximate circuit simulator. We did some infor- 
mal studies of the preconditioning process by distorting the 
solution waveforms obtained from a WR solver. We discovered 
two different regimes. Very rapid convergence to the exact 
waveforms was observed, provided that the distortion was not 
too large. For the case in which the distortion was large, the 
starting waveforms seem to have little impact on the conver- 
gence behavior. It should be noted that many other situations 
are relevant For example, in a hierarchical situation as is 
shown in Fig. 1 only a few waveforms need to be known at an 
interface between the functional units to enabl the analysis 
of other functional units using tmatin g waveforms for WR 



PARALLEL WAVEFORM-RELAXATION-BASED 
CIRCUIT SIMULATION 

Parallel implementations of WR have been investigated by 
many researchers (47,51-57) since the approach is ideally 
suited for parallelization. Many of the techniques developed 
for parallel WR are detailed in the book by Banerjee (58). Be- 
cause each subcircuit is solved independently, subcircuits can 
be distributed among multiple processors and solved concur- 
rently. During every iteration, each processor must have ac- 
cess to the input waveforms for each subcircuit that it is to 
solve. Once waveforms are available, a processor can then 
solve a subcircuit over a time window T. Only after a subcir- 
cuit has been solved is there a need to share data among pro- 
cessors. This results in infrequent sharing of relatively large 
blocks of data among processors. Generally the time to solve 
each subcircuit is relatively long compared with the time 
needed to communicate results among processors. This im- 
plies that the ratio of time for computation to communication 
will be high, and good parallel speedups are possible. More^ 
over, as circuit size increases, the size of each subcircuit often 
remains relatively constant, while the number of subcircuits 
generally increases. Therefore as circuit size increases, the 
opportunities for parallelism also increase. 

Architecture Considerations 

Parallel-processing machines can be grouped into two classes: 
single-instruction, multiple-data (SIMD) and multiple-in- 
struction, multiple-data (MIMD). In a SIMD machine, each 
processor executes the same instructions on different data 
streams. In a MIMD machine, each processor executes differ- 
ent instructions on different data streams. Parallel WR solves 
different subcircuits on each processor, and therefore each 
processor will in general be executing different instructions 
on different data, which implies that parallel WR is best 
suited for a MIMD architecture. Additionally, both SIMD and 
MIMD machines can be implemented using either shared or 
distributed memory. In a shared-memory machine, each pro- 
cessor is capable of accessing all memory in the machine. It 
is usually the programmer's responsibility to make sure no 
two processors attempt to access the same memory locations 
simultaneously. The Cray 0-90" and SGI IRIS Challenger"" 
are examples of shared memory MIMD machines. In a dis- 
tributed memory machine, each processor has its own local 
memory, which cannot be accessed by other processors. Shar- 
ing of data is accomplished through message passing between 
processors. One form of distributed memory machine is a net- 
work of workstations using MPI to share data over a network. 
The IBM SP2™, Intel Paragon ~, and Cray T3D™ are exam- 
ples of more closely coupled distributed-memory MIMD ma- 
chines. One advantage of distributed-memory machines is 
that no single processor needs to have enough memory to hold 
all of the data for analysis. This becomes increasingly impor- 
tant as circuit sizes increase. On the other hand, shared mem- 
ory permits faster exchange of data among processors. 

As stated above, a MIMD architecture is well suited for 
WR where parallelism is applied at the subcircuit level with 
each processor solving its own set of subcircuits. Either 
shared or distributed memory can be used, each with its own 
advantages and disadvantages. In a shared-memory environ- 
ment, it is easier to balance work load among the processors, 
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because each processor has complete access to all data rela- 
tive to the analysis. As each processor completes an analysis 
of a subcircuit, it solves the next subcircuit that is ready to 
be processed (61). In this way, slower processors will automat- 
ically take on less work, while faster processors will do more. 
One associated disadvantage is that a relatively complicated 
locking mechanism must be implemented to prohibit different 
processors from trying to read and write the same data at the 
same time. Another is that all input data and computed re- 
sults must fit within the globally shared memory. 

Distributed memory eliminates problems relating to simul- 
taneous access of data and the need to have all data fit within 
one global memory. However, because all data are not easily 
accessible to all processors, it is harder to balance work load. 
Most implementations statically assign subcircuits to pro- 
cessors at the beginning of an analysis using a combination 
of heuristics to attempt to predict and balance work load and 
communication patterns (56). Dynamic work load balancing 
(59) requires the transfer of subcircuits and their "state" from 
one processor to another, which may be several thousands of 
bytes. If these transfers cannot be done quickly or they must 
be done often, it may be faster to stay with a euboptimal subc- 
ircuit to processor assignment In addition, performance may 
be affected by the time required to share data among pro- 
cessors. Fortunately, windowed WR at the subcircuit level re- 
quires infrequent sharing of data among processors. Never- 
theless, the time to communicate results may be a significant 
portion of total job time. Consequently, most MIMD imple- 
mentations attempt to minimize communication by assigning 
subcircuits that share data to the same processor and to 
"hide" communication overhead by overlapping communica- 
tion and computation, that is, by continuing to compute addi- 
tional results while communication is progressing. The under- 
lying assumption is that parallel WR is applied to very large 
circuits that partition into many subcircuits, and that there 
are many more subcircuits than processors. Therefore, each 
processor will generally have sufficient work to remain active 
while data are being shared among processors. 



Mgorithm Selection 

it was shown earlier that the GS relaxation algorithm will, in 
general, converge in fewer iterations than the GJ algorithm, 
uid is usually the favored implementation for sequential pro- 
:essing. However, the faster convergence rate of the GS algo- 
ithm is derived from an ordering and scheduling of subcircu- 
ta that limits parallelism. Parallelism is limited by the 
lumber of subcircuits that can be scheduled at each Seidel 
■jvel. Circuits that partition into long chains of subcircuits 
nth little fanout will have little parallelism to exploit, 
hereas circuits like the DRAM error correction circuit shown 
i Pig. 2 offer a great deal of potential parallelism. In con- 
ast, parallelism using the GJ algorithm is limited only by 
ie number of subcircuits. With the GJ algorithm, during 
aveform iteration (w + 1) all subcircuits are solved using 
put waveforms computed during iteration («/). Hence no or- 
ring of subcircuits is necessary. This implies that once all 
bcircuits have been solved for an iteration, all data are 
ailable to schedule all subcircuits for the next iteration. 
>osequently, the GJ algorithm has the potential for parallel- 
n that is equal to and increases linearly with the number 
subcircuits. 



SSI 

Although the parallel GS method will retain a faster con 
vergence rate over the GJ method (fewer iterations), because 
of the limits on available parallelism, the time to complete 
those iterations may actually be longer than the time to com- 
plete the GJ iterations. If th number of available processors 
is large, the GJ algorithm will in general be able to use all of 
them. The GS algorithm, on the other hand, will only be able 
to use effectively a number of processors equal to the maxi- 
mum number of subcircuits scheduled at any Seidel level. 
Therefore, the GS algorithm is not necessarily the best algo- 
rithm for parallel processing. However, if the number of pro- 
cessors is smaller than the average number of subcircuits at 
each Seidel level, then the GS method is probably the better 
choice. In such cases parallelism will be limited by the num- 
ber of processors, and the faster convergence rate of the GS 
algorithm will result in a faster solution. In most applications 
the number of processors is limited, whereas the number of 
subcircuits and their relationship to one another is circuit de- 
pendent. The best implementation would be to include both 
algorithms with automatic selection of the GS or GJ algo- 
rithms based upon circuit topology and the number of pro- 
cessors available to solve the problem. 

Another implementation consideration is memory usage. 
In order to determine convergence, at any iteration (w +1) 
both GS and GJ algorithms require storage to hold computed 
waveforms for iterations (w) and (w + 1). For each subcircuit, 
complete waveforms must be retained for all computed wave- 
forms for two iterations. For a single processor, this implies 
that all waveforms must be stored twice. However, on a multi- 
processor system, each processor only needs to store iteration 
(w) and (w + 1) values for those waveforms that are actually 
computed on that processor, along with waveforms for either 
the (w) or (w + 1) iteration of inputs solved on other proces- 
sors. Input waveforms are needed for iteration (w) when using 
the GJ algorithm and for iteration (w + 1) when using the 
GS algorithm. With the GS algorithm, newly computed wave- 
forms can be shared with other processors immediately. How- 
ever, unless each processor maintains storage for inputs for 
both iterations (w + 1) and (w), the GJ algorithm must delay 
sharing newly computed waveforms among processors until 
all processors have completed each waveform iteration. Oth- 
erwise, data for iteration (w + 1) may overwrite data expected 
to be for iteration (w). Consequently, the parallel GS method 
can be implemented to use less storage per processor than the 
GJ method The alternative is to defer sharing of data until 
all processors have completed an iteration. This can result in 
communication bottlenecks and substantially reduce perfor- 
mance, especially for distributed-memory machines. 

With the GS algorithm, data must be shared among pro- 
cessors throughout the analysis of a time window in order for 
the solution to proceed. If input waveforms are not available 
to solve a subcircuit, a processor may have to wait for data to 
be computed on another processor. So not only does the GS 
algorithm limit parallelism, it also may introduce bottlenecks 
and adversely affect load balance among processors. In an at- 
tempt to reduce these effects, Zukowski and Johnson (60) 
have reported implementation of a "mixed"' Seidel-Jacobi or 
bounded-chaotic algorithm that attempts to solve all subcir- 
cuits using the GS algorithm. However, if a processor is idled 
due to lack of input waveforms for the current iteration, a 
subcircuit is chosen to be solved using whatever waveforms 
are available. Some inputs may be from the current, while 
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others may be from the previous iteration. The algorithm is 
bounded in that waveforms can be at most one iteration be- 
hind the current iteration, like the Jacobi algorithm. The 
hope is that a solution will be completed faster if processors 
remain busy, even if all of the input waveforms do not meet 
strict Seidel ordering. For circuits with large fanouts that per- 
mit the effective use of a large number of processors, this im- 
plementation retains the faster convergence rate of the GS 
algorithm. For circuits with less fanout, this technique should 
take no longer than the GJ algorithm in which all input wave- 
forms are from the previous iteration. 

Implementations 

ParaUel WR may be implemented using either a master- 
slave or a data-driven paradigm. In a master-slave imple- 
mentation, one processor serves work to the others and syn- 
chronizes each iteration of the analysis. Hie master-slave 
setup is well suited for a shared-memory machine, because 
all data are available to all processors, and therefore the mas- 
ter can quickly assign any work item to any processor without 
the need to transmit large quantities of data. In addition, the 
master can maintain data integrity by only permitting one 
processor to access a specific data item at a time. 

In a data-driven implementation, each processor solves its 
assigned subcircuits as soon as input waveforms are avail- 
able. Synchronization is required only to determine conver- 
gence, update window boundaries, and prepare output files. 
A data-driven implementation will function equally well on 
either shared- or distributed-memory machines. 

Efficiency Improvements 

With either implementation, whenever input data are avail- 
able to solve a subcircuit, the circuit can be scheduled for 
analysis. In general, there will be many more subcircuits than 
processors, and each processor will have more than one sub- 
circuit that can be solved at any time. Under such conditions, 
a choice must be made as to the order in which the subcircuits 
are solved. When using the GJ algorithm, the choice is unim- 
portant However, when using the GS algorithm, this choice 
may greatly affect overall performance and throughput. The 
subcircuits for which data are available should be sorted and 
solved in order based upon the level at which their outputs 
are needed. For example, consider the situation in which a 
processor has two subcircuits that can be solved. One has out- 
puts that are needed as inputs to another subcircuit at level 
4, while the other subcircuif s outputs are not needed until 
level 5. The subcircuit whose outputs are needed at level 4 
should be solved first. This will permit the outputs to be com- 
municated to other processors while the second subcircuit is 
being solved. Hopefully the data will arrive before the second 
processor finishes the Bubcircuits it is currently solving, and 
the processor will not have to wait for data. 

In the previous section "Ordering and Scheduling," options 
were discussed for dealing with feedback loops. However, the 
choice of whether to break feedback loops into SCkts is differ- 
ent when using a multiprocessor system (46). One of the pri- 
mary goals of parallel WR is to keep all of the processors busy 
most of the time. Feedback loops that are merged into a single 
subcircuit maintain strict GS rdering, but they create larger 
subcircuits. This has a negative impact n load bala ncing , 
matrix size, and the multirate speedup. However if feedback 



Table 3. Timing Results 

Time (s) 



Ckt Name 


All Loops Cut 


Only Long Loops Cut 


Cktl 


28 


116 


Ckt2 


33 


162 


Ckt3 


46 


134 


Ckt4 


90 


246 


Cktfi 


103 


242 


Ckt6 


106 


297 


Ckt7 


113 


278 


Ckt8 


142 


175 


Ckt9 


155 


285 


CktlO 


159 


288 


Cktll 


195 


355 


Cktl2 


226 


316 


Cktl3 


229 


352 


Cktl4 


537 


815 


Cktl5 


477 


1111 


Cktl6 


1025 


1509 



loops are cut such that two (or more) similarly sized subcircu- 
its are created, these subcircuits can be distributed among 
the processors. Since we expect cut feedback loops to result in 
additional WR iterations, it is advantageous to iterate multi- 
ple times during each WR iteration among subcircuits re- 
sulting from cut feedback loops. Table 3 gives timing results 
for 16 circuits ranging in size from under 300 to over 93,000 
transistors when all feedback loops are cut versus only cut- 
ting long loops where the feedback loop extends over several 
subcircuits. These results were obtained using the experimen- 
tal Victor, V256 processor described in Ref. 56, with the larger 
circuits using all 256 processors. 

SUMMARY AND CONCLUSIONS 

We summarize the state of the waveform-relaxation tech- 
niques in this article. WR is a very active area of research as 
is evident from the publications listed here, which are only a 
fraction of all the work done in this area. Also, there are many 
more relevant works on WR that are of interest. To mention 
just a few topics of interest, there are the faster sensitivity 
computations by Chen and Feng (61) and the related error 
measuring technique by Gristede, Zukowski, and Ruehli (62). 
Other work of importance is hierarchical WR by Saviz and 
Wing (35). We hope that it is evident from this article that 
WR is an interesting area of research with potential for fur- 
ther innovations as well as applications. 

The WR approach shows a clear speed advantage for very 
large circuits over conventional circuit solvers. However, even 
today a fast workstation is required to run circuits that are 
large enough to show substantial gains. This may be of inter- 
est for a large company or to a university, but it is of a lesser 
interest to the average user of a circuit solver like the many 
SPICE-like tools that may run on a small machine. We expect 
that the WR approach will become much more popular with 
the next generation of high-performance workstations, which 
include multiple processors at a more moderate price. As is 
evident from this article, the. gains in compute time will be 
substantial W expect that the availability f parallel com- 
puting for a wider audience will make th WR algorithms f 
more interest to EDA companies. 
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CIRCUIT ANALYSIS COMPUTING OF 
SEMICONDUCTOR PACKAGES AND SYSTEMS 

The electronic package has four major functions: (1) supply of 
electrical power necessary for operation of electronic circuits, 
(2) distribution of electrical signals carrying information, (3) 
removal of heat generated during operation of circuits, and 
(4) mechanical support of circuits and protection from envi- 
ronment. Electrical package design is concerned primarily 
with the first two functions. The typical goal of electrical de- 
sign is to obtain a package that supports required speed of 
system operation and maintains electrical noises at or below 
tolerances. Package design is also influenced by cost and re- 
liability considerations. However, this article deals with a 
package electrical performance, which is related to the first 
two listed package functions. 

Design is based on package mathematical modeling, which 
describes relations between a performance of designed object 
and design variables. Mathematical modeling of electronic 
packaging is quite involved, and resulting models are compli- 
cated. Consequently, analytical approaches to package design 
are so limited that computer support is required. Simulation 
that involves computer implementation of package models 
and numerical imitation of package performance is commonly 
used as a design support. 

ELECTRICAL PACKAGE MODELING 

Electrical package modeling (i.e., description of relations be- 
tween electrical performance and design variables) is based 
on a number of simplifying assumptions that depend on the 
type of signals considered and packaged device/system appli- 
cation. Therefore, packages can be classified into specific cate- 
gories on the basis of application and modeling features. Thus 
modeling the following package categories will be discussed: 
radio-frequency (RF) packages (i.e., packages housing RF cir- 
cuits), digital packages housing digital circuits, and mixed- 
signal packages containing circuits that perform operations 
on both analog and digital signals. 

RF Package Modeling 

Modeling of RF packages is closely related to modeling of 
packaged devices and circuits, and in most cases it is very 
complicated. Modeling of parasitic couplings between the 
components is a particularly troublesome problem. Typical 
models are developed in frequency domain because the rele- 
vant circuits and packages usually operate in narrow fre- 
quency bands. These models are based on linear approxima- 
tion (linearization), which is valid for small signal operations. 
Specialized software tools must be used to support modeling. 
Such tools are being intensively developed because of growing 
demand for a variety of RF circuits needed in wireless com- 
munication. These tools are continually updated and modified 
in cooperation with the user community. In light of the rapid 
changes of simulation tools, they are not described here in 
detail. 

Some circuits such as RF power amplifiers operate with 
large signals such that the linear approximation cannot be 
used, and novel methods like harmonic balance are used in 
modeling instead (1). Both modeling methods and supporting 
software are under vigorous development. 



